很多人总是问我如何挖掘TCGA的数据,发文章!
可是他却连TCGA的数据是怎么来的都不知道,TCGA发了几十篇CNS大文章(自己测序的)了,每篇文章都有几百个左右的癌症样本的6种数据,这几年凑成了一万多个样本,都放在GDC里面可以根据权限下载。同时也出来了十几篇TCGA的数据挖掘大文章(主要包括亚型,driver mutation,假基因等新型研究领域)
那么一篇标准的一个标准的TCGA大文章应该自己测哪些数据?
其实稍微仔细浏览几篇文章就明白了,套路也是存在的,https://tcga-data.nci.nih.gov/docs/publications/ (本人已经写爬虫把所有TCGA在CNS的大文章的PDF及附件全部下载,请后台回复TCGA大文章获取!)
我们就以2013年发表在新英格兰杂志上面的Genomic and Epigenomic Landscapes of Adult De Novo Acute Myeloid Leukemia 为例子吧!
研究的是acute myeloid leukemia (AML),在医院花个十年时间精心挑选了200 adults with de novo AML ,当然病人详细信息是要给的,还要符合伦理,签知情协议书吧。
We performed whole-genome sequencing of the primary tumor and matched normal skin samples from 50 patients (with data from 24 of these patients reported previously17) and exome capture and sequencing for another 150 paired samples of AML tumor and skin (see Table S3 in the Supplementary Appendix for coverage data for the 200 samples).
全基因组测序毕竟贵,就只测50个吧,当然,癌症样本要取癌旁配对研究才有意义。剩余的就做外显子吧,毕竟便宜一点!
We performed RNA-expression profiling on the Affymetrix U133 Plus 2 platform for 197 samples, RNA sequencing for 179 samples, microRNA (miRNA) sequencing for 194 samples, Illumina Infinium HumanMethylation450 BeadChip profiling for 192 samples, and Affymetrix SNP Array 6.0 for both tumor and normal skin samples from all 200 patients.
接着就是芯片和测序的mRNA表达数据,然后是测序的miRNA表达就是,然后是芯片的甲基化数据,和芯片的拷贝数变异检测数据。
Data sets were not completed for all samples on all platforms because of assay failures and availability and quality issues for some samples. The complete list of data sets is provided in Table S4 in the Supplementary Appendix. All data sets are available through the Cancer Genome Atlas (TCGA) data portal (https://tcga-data.nci.nih.gov/tcga).
这么多数据都给TCGA贡献出来了,不发大文章,就没天理了。
至于怎么分析,在现在我们看来,就是一些套路了。(当然,没有两把刀估计连套路都看不懂)
但是这些数据,他们一个组分析肯定只能是挑重点说咯,所以TCGA数据挖掘首先就是可以捡人家剩下的,然后可以把多个癌种合起来分析。
下载这些数据对很多人来说是比较困难的事情,我后面要是有空会讲一讲,大家请持续关注。
虽然在TCGA中直接下载数据的方法较为繁琐,但是有多个网站提供TCGA数据(包括表达和临床等)完善的整理:GDAC, Cancer Browser和cBioportal是其中整理最为完整和可靠的。GDAC由美国MIT和Harvard共建的Broadinstitute运行,UCSC运行着Cancer Browser 和Xena, cBioportal由MemorialSloan-Kettering Cancer Cente建立,提供较为完善的TCGA数据为基础的各类信息检索服务。
再后面全是英文,我懒得翻译,我估计你也懒得看了,赶快去我们微信公众号后台回复TCGA大文章吧,获取全部PDF文章及附件,也分享给你的好友!
还附送TCGA的所有youtube的topic视频,关爱一下困难地区的小朋友!
Values for disease include "ACC", "BLCA", "BRCA", "CESC", "CHOL", "COAD", "COADREAD", "DLBC", "ESCA", "FPPP", "GBM", "GBMLGG", "HNSC", "KICH", "KIPAN", "KIRC", "KIRP", "LAML", "LGG", "LIHC", "LUAD", "LUSC", "MESO", "OV", "PAAD", "PCPG", "PRAD", "READ", "SARC", "SKCM", "STAD", "TGCT", "THCA", "THYM", "UCEC", "UCS", and "UVM".
数据类型有:"RNASeq2", "RNASeq", "miRNASeq", "CNA_SNP", "CNV_SNP", "CNA_CGH", "Methylation", "Mutation", "mRNA_Array", and "miRNA_Array".
详细解释如下:
The type parameter should only be used along with these data.type parameters:
• RNASeq - "count" for raw read counts (default);"RPKM" for normalized read counts (reads per kilobase per million mapped reads).
• miRNASeq - "count" for raw read counts (default); "rpmmm" for normalized read counts.
• Mutation - "somatic" for non-silent somatic mutations (default); "all" for all mutations. • Methylation - "27K" platform (default); "450K" platform. • CNA_CGH - "415K" for CGH Custom Microarray 2x415K (default); "244A" for CGH Microarray.
• mRNA_Array - "G450" for Agilent 244K Custom Gene Expression G4502A (default); "U133" for Affymetrix Human Genome U133A 2.0 Array; "Huex" for Affymetrix Human Exon 1.0 ST Array.
The Level III RNA-Seq, miRNA-Seq, mRNA-array, and miRNA-array data imported are at gene level, but not the mutation, copy number alterations/variation (CNA/CNV), and methylation data.
就先说到这里吧!